Skip to content

[BugFix] Ensure num_cached_tokens is non-negative for kv transfer failed requests#37354

Closed
KingsleyZhang123 wants to merge 1 commit intovllm-project:mainfrom
KingsleyZhang123:Failed-kv-transfer-request-handling
Closed

[BugFix] Ensure num_cached_tokens is non-negative for kv transfer failed requests#37354
KingsleyZhang123 wants to merge 1 commit intovllm-project:mainfrom
KingsleyZhang123:Failed-kv-transfer-request-handling

Conversation

@KingsleyZhang123
Copy link
Copy Markdown
Contributor

@KingsleyZhang123 KingsleyZhang123 commented Mar 17, 2026

Purpose

BugFix for failure kv load requests handling
For requests failing KV load in decode side, since it's still in WAITING_REMOTE_KV state, its num_cached_tokens are still the default -1, and it was never updated, when we do metrics logging on local_cache_hit, -1 will be used and will crash engine due to:

ValueError: Counters can only be incremented by non-negative amounts.

Test Plan

Tested with stress testing and intentionally fail random kv transfer to surface the bug.

Test Result

Before:

logger will crash with ValueError: Counters can only be incremented by non-negative amounts.

After:

No crash observed

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…led requests

For requests failing KV load in decode side, since it's still in WAITING_REMOTE_KV state, its num_cached_tokens are still the  default -1, and it was never updated, when we do metrics logging on local_cache_hit, -1 will be used and will crash engine due to:

ValueError: Counters can only be incremented by non-negative amounts.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a ValueError that occurs when logging metrics for requests that fail KV transfer. The num_cached_tokens for such requests can remain at its default value of -1, which causes a crash when used to increment a counter. The fix applies max(0, ...) to ensure num_cached_tokens is always non-negative when creating the EngineCoreOutput for finished requests. This is a direct and effective solution to the problem. The change is correct and I have no further suggestions.

Copy link
Copy Markdown
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold because we're working through this in #36859

@markmc
Copy link
Copy Markdown
Member

markmc commented Mar 18, 2026

If you have a clear instructions for reproducing this issue, that would be helpful. Thanks

@KingsleyZhang123
Copy link
Copy Markdown
Contributor Author

Hold because we're working through this in #36859

OK I think this commit works as well,
to reproduce, use P/D disagg kv connector with async_kv_load set to True, and manually failed the kv transfer, so that the request is marked as failure when it's still in WAITING_REMOTE_KV state, where its num_cached_tokens is still -1.

@chenminghua8
Copy link
Copy Markdown

If you have a clear instructions for reproducing this issue, that would be helpful. Thanks

Use a100 80g to start vllm:
LMCACHE_LOG_LEVEL=ERROR
LMCACHE_LOCAL_CPU=True
LMCACHE_MAX_LOCAL_CPU_SIZE=35
LMCACHE_CHUNK_SIZE=256
LMCACHE_USE_EXPERIMENTAL=True
PYTHONHASHSEED=0
CUDA_VISIBLE_DEVICES=1
vllm serve /data/model/Qwen2.5-32B-Instruct
--served-model-name Qwen2.5-32B
--max-model-len=20000 --gpu-memory-utilization 0.9
--tensor-parallel-size 1
--max-num-batched-tokens 4096
--load-format dummy
--port 8500
--enable-prefix-caching
--kv-transfer-config
'{"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"}'

multiple rounds of session testing:
python3 LMCache/benchmarks/multi_round_qa/multi-round-qa.py
--num-users 120
--num-rounds 5
--qps 6
--shared-system-prompt 0
--user-history-prompt 1024
--answer-len 200
--model Qwen2.5-32B
--base-url http://localhost:8500/v1

@markmc markmc moved this from Backlog to In Review in Metrics & Tracing Apr 8, 2026
@markmc
Copy link
Copy Markdown
Member

markmc commented Apr 8, 2026

The issue has been fixed on main since #37160 introduced this band-aid:

        self.local_cache_hit += max(
            0, (num_cached_tokens + recomputed - num_external_computed_tokens)

#37460 is the current candidate for a long-term fix

@markmc markmc closed this Apr 8, 2026
@markmc markmc moved this from In Review to Not planned in Metrics & Tracing Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

Status: Not planned

Development

Successfully merging this pull request may close these issues.

4 participants